Pruning filters for efficient convolutional neural networks for image recognition in surveillance applications

ABSTRACT

Systems and methods for pruning a convolutional neural network (CNN) for surveillance with image recognition are described, including extracting convolutional layers from a trained CNN, each convolutional layer including a kernel matrix having at least one filter formed in a corresponding output channel of the kernel matrix, and a feature map set having a feature map corresponding to each filter. An absolute kernel weight is determined for each kernel and summed across each filter to determine a magnitude of each filter. The magnitude of each filter is compared with a threshold and removed if it is below the threshold. A feature map corresponding to each of the removed filters is removed to prune the CNN of filters. The CNN is retrained to generate a pruned CNN having fewer convolutional layers to efficiently recognize and predict conditions in an environment being surveilled.

RELATED APPLICATION INFORMATION

This application claims priority to 62/506,657, filed on May 16, 2017,incorporated herein by reference in its entirety. This application isrelated to an application entitled “PRUNING FILTERS FOR EFFICIENTCONVOLUTIONAL NEURAL NETWORKS FOR IMAGE RECOGNITION OF ENVIRONMENTALHAZARDS”, having attorney docket number 16085B, and an applicationentitled “PRUNING FILTERS FOR EFFICIENT CONVOLUTIONAL NEURAL NETWORKSFOR IMAGE RECOGNITION IN VEHICLES”, having attorney docket number16085C, and which are incorporated by reference herein in theirentirety.

BACKGROUND Technical Field

The present invention relates to image recognition with neural networksand more particularly image recognition filter pruning for efficientconvolutional neural networks for surveillance applications.

Description of the Related Art

Convolutional neural networks (CNNs) can be used to provide imagerecognition. As image recognition efforts have become moresophisticated, so have CNNs for image recognition, using deeper anddeeper networks with greater parameters and convolutions. However, thistrend also results in a greater need of the CNN for computational andpower resources. Thus, image recognition with CNNs is impractical, andindeed, in some instances, impossible in embedded and mobileapplication. Even application that have the power and computationalresources for accurate CNNs would benefit from more efficient imagerecognition. Simply compressing or pruning the weights of layers of aneural network would not adequately reduce the costs of a deep neuralnetwork.

SUMMARY

According to an aspect of the present principles, a method is providedfor pruning a convolutional neural network (CNN) for surveillance withimage recognition is described. The method includes extracting at leastone convolutional layer from a trained CNN, each convolutional layerincluding a kernel matrix having at least one filter formed in acorresponding output channel of the kernel matrix, and feature map sethaving a feature map corresponding to each of the at least one filter.An absolute kernel weight is determined for each kernel in the kernelmatrix. The absolute kernel weights of each kernel in each of the atleast one filter are summed to determine a magnitude of each filter. Themagnitude of each filter is compared with a threshold and removing oneor more filters that are below the threshold. A feature mapcorresponding to each of the removed filters is removed to prune the CNNof filters. The CNN is retrained upon pruning the removed filters togenerate a pruned CNN having fewer convolutional layers to efficientlyrecognize and predict conditions in an environment being surveilled.

According to an aspect of the present principles, a non-transitorycomputer readable storage medium comprising a computer readable programfor surveillance with image recognition using a pruned convolutionalneural network (CNN) is described. The computer readable program whenexecuted on a computer causes the computer to perform the stepsincluding extracting at least one convolutional layer from a trainedCNN, each convolutional layer including a kernel matrix having at leastone filter formed in a corresponding output channel of the kernelmatrix, and feature map set having a feature map corresponding to eachof the at least one filter. An absolute kernel weight is determined foreach kernel in the kernel matrix. The absolute kernel weights of eachkernel in each of the at least one filter are summed to determine amagnitude of each filter. The magnitude of each filter is compared witha threshold and removing one or more filters that are below thethreshold. A feature map corresponding to each of the removed filters isremoved to prune the CNN of filters. The CNN is retrained upon pruningthe removed filters to generate a pruned CNN having fewer convolutionallayers to efficiently recognize and predict conditions in an environmentbeing surveilled.

According to another aspect of the present principles, a system isprovided for image recognition system for surveillance. The systemincludes an image capture device for capturing images of an environmentto be surveilled. An image recognition system is included in an embeddedcomputing device included in the image capture device configured toperform image recognition with a pruned CNN. The image recognitionsystem includes an absolute kernel weight summer configured todetermining an absolute kernel weight for each kernel in the kernelmatrix and sum the absolute kernel weights of each kernel in a filtercorresponding to each of at least one output channel of a kernel matrixto determine a magnitude of each filter, each filter corresponding to afeature map. The image recognitions system further includes a thresholdcomparison unit configured to comparing the magnitude of each filterwith a threshold and removing one or more filters that are below thethreshold and a layer updater configured to removing a feature mapcorresponding to each of the removed filters to prune the CNN of filtersand generate the pruned CNN. A long short-term memory network (LSTM) isincluded for predicting feature actions. An action network is includedfor generating class probabilities of feature actions. A notificationdevice is included for notifying a user of the class probabilities.

These and other features and advantages will become apparent from thefollowing detailed description of illustrative embodiments thereof,which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will provide details in the following description ofpreferred embodiments with reference to the following figures wherein:

FIG. 1 is a block/flow diagram illustrating a high-level system/methodfor surveillance with image recognition using a pruned convolutionneural network (CNN), in accordance with the present principles;

FIG. 2 is a block/flow diagram illustrating a system/method for imagerecognition using a pruned convolution neural network (CNN), inaccordance with the present principles;

FIG. 3 is a block/flow diagram illustrating a system/method for pruningfilters of a convolution neural network (CNN), in accordance with thepresent principles;

FIG. 4 is a block/flow diagram illustrating a system/method for pruningfilters of a simple convolution neural network (CNN), in accordance withthe present principles;

FIG. 5 is a block/flow diagram illustrating a system/method for pruningintermediate filters of a simple convolution neural network (CNN), inaccordance with the present principles;

FIG. 6 is a block/flow diagram illustrating a system/method for pruningintermediate filters of a residual convolution neural network (CNN), inaccordance with the present principles;

FIG. 7 is a block/flow diagram illustrating a high-level system/methodfor surveillance of forest fires with image recognition using a prunedconvolution neural network (CNN), in accordance with the presentprinciples;

FIG. 8 is a block/flow diagram illustrating a high-level system/methodfor surveillance for vehicles with image recognition using a prunedconvolution neural network (CNN), in accordance with the presentprinciples; and

FIG. 9 is a flow diagram illustrating a system/method for imagerecognition using a pruned convolution neural network (CNN), inaccordance with the present principles.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

In accordance with the present principles, systems and methods areprovided for a convolutional neural network (CNN) trained with prunedfilters for image recognition in surveillance applications.

In one embodiment, the number of filters in a CNN is reduced by pruning.This pruning is accomplished by training a CNN for image recognition ina surveillance application. Once trained, the filters of the CNN can beassessed by determining the weights of each filter. By removing thefilters that have small weights, the filters that have littlecontribution to accuracy can be removed, and thus pruned.

Once the filters have been pruned, the CNN can be retrained until itreaches its original level of accuracy. Thus, fewer filters are employedin a CNN that is equally accurate. By removing filters, the number ofconvolution operations and reduced, thus reducing computation costs,including computer resource requirements as well as power requirements.This pruning process also avoids the need to maintain sparse datastructures or sparse convolution libraries because the filters havinglower contributions are completely removed. As a result, the pruned CNNcan be made efficient enough to be employed in embedded devices andmobile devices such, e.g., digital cameras and camcorders, personalcomputers, tablets, smartphones, vehicles, drones, satellites, amongothers. Predictions may be made of future situations and actionsaccording to the recognized images. Thus, a surveillance systememploying the pruned CNN can leverage the more efficient imagerecognition to achieve situation predictions early and more quickly sothat more effective and better-informed actions may be proactivelytaken.

Embodiments described herein may be entirely hardware, entirely softwareor including both hardware and software elements. In a preferredembodiment, the present invention is implemented in software, whichincludes but is not limited to firmware, resident software, microcode,etc.

Embodiments may include a computer program product accessible from acomputer-usable or computer-readable medium providing program code foruse by or in connection with a computer or any instruction executionsystem. A computer-usable or computer readable medium may include anyapparatus that stores, communicates, propagates, or transports theprogram for use by or in connection with the instruction executionsystem, apparatus, or device. The medium can be magnetic, optical,electronic, electromagnetic, infrared, or semiconductor system (orapparatus or device) or a propagation medium. The medium may include acomputer-readable storage medium such as a semiconductor or solid statememory, magnetic tape, a removable computer diskette, a random accessmemory (RAM), a read-only memory (ROM), a rigid magnetic disk and anoptical disk, etc.

Each computer program may be tangibly stored in a machine-readablestorage media or device (e.g., program memory or magnetic disk) readableby a general or special purpose programmable computer, for configuringand controlling operation of a computer when the storage media or deviceis read by the computer to perform the procedures described herein. Theinventive system may also be considered to be embodied in acomputer-readable storage medium, configured with a computer program,where the storage medium so configured causes a computer to operate in aspecific and predefined manner to perform the functions describedherein.

A data processing system suitable for storing and/or executing programcode may include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code to reduce the number of times code is retrieved frombulk storage during execution. Input/output or I/O devices (includingbut not limited to keyboards, displays, pointing devices, etc.) may becoupled to the system either directly or through intervening I/Ocontrollers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modem and Ethernet cards are just a few of thecurrently available types of network adapters.

Referring now in detail to the figures in which like numerals representthe same or similar elements and initially to FIG. 1, a high-levelsystem/method for surveillance with image recognition using a prunedconvolution neural network (CNN) is illustratively depicted inaccordance with one embodiment of the present principles.

In one embodiment, a system is contemplated for providing surveillancefor an area of interest 140. The area of interest 140 may be, e.g., theinterior or exterior of a building (for example, a shopping mall,airport, office building, etc.), a parking lot, the interior or exteriorof a house, or any other public or private place for which surveillancemay be desired. According to aspects of the present invention,surveillance may include image recognition to facilitate the recognitionand response to potentially dangerous or hazardous situations, such as,e.g., fires, floods, criminal activity, or any other natural or man-madecondition in the area of interest 140.

Surveillance may be performed by an image capture device 100. The imagecapture device 100 is used to capture an image or sequence of images ofthe area of interest 140 such that the image or sequence of images maybe analyzed to recognize or predict a hazardous situation. Accordingly,the image capture device 100 may include, e.g., a still or video camera,or any device incorporating a suitable sensor for capturing images (forexample, e.g., a device including a charge-coupled device (CCD), aphotodiode, a complementary metal oxide semiconductor (CMOS) sensor,infrared sensor, laser distance and ranging (LIDAR) sensor, amongothers).

The image capture device 100 may include a processing system 200 forperforming the analysis, including image recognition. The processingsystem 200 may an internal component to the image capture device 100, ormay external and in communication to the image capture device 100.Accordingly, the processing system 200 may include, e.g., asystem-on-chip (SOC), a computer processing unit (CPU), a graphicalprocessing unit (GPU), a computer system, a network, or any otherprocessing system.

The processing system 200 may include a computer processing device 210for processing the image or sequence of images provided by the imagecapture device 100. Accordingly, the computer processing device 210 caninclude, e.g., a CPU, a GPU, or other processing device or combinationof processing devices. According to aspects of an embodiment of thepresent invention, the processing system 200 is an embedded processingsystem including, e.g., a SOC, or a mobile device such as, e.g., asmartphone.

Such systems have strict power and resource constraints because they donot benefit from grid power or large packaging then provide a largevolume of power, storage or processing power. Therefore, to perform theimage recognition, the computer processing device 210 includes a prunedCNN 212. The pruned CNN 212 reduces the complexity of the CNN for imagerecognition by pruning particular portions of the CNN that have arelatively insignificant impact on the accuracy of the imagerecognition. Thus, pruning a CNN can reduce the resource requirements ofthe CNN by reducing complexity, while having minimal impact on theaccuracy of the CNN. As a result, the pruned CNN 212 may be, e.g.,stored in volatile memory, such as, e.g., dynamic random access memory(DRAM), processor cache, random access memory (RAM), or a storagedevice, among other possibilities. Accordingly, the pruned CNN 212 maybe stored locally to an embedded or mobile processing system and provideaccurate image recognition results.

It has been found that convolution operations contribute significantlytowards overall computation. Indeed, the convolution operationsthemselves can contribute to up to about 90.9% of overall computationeffort. Therefore, reducing the number of convolution operations throughpruning will cause a corresponding reduction in computing resourcesdemanded by the operation of the CNN, such as, e.g., electrical power,processing time, memory usage, storage usage, and other resources.

Thus, according to aspects of the present invention, the pruning of theCNN can take the form, e.g., of pruning filters from convolution layersof the CNN. According to aspects of one possible embodiment, a fullytrained CNN is tested at each layer to locate the filters with smallweights. Small weights may be determined, e.g., based on their relationto other filter weights (e.g., a number of the smallest weightedfilters), or by being below a certain threshold value, or by anothersuitable standard. Alternatively, or in addition, sensitivity to pruningmay be a determining factor for whether a filter is pruned. Sensitivitymay be assessed by, e.g., removing a filter and testing the effect ofthe removal on accuracy to determine if it exceeds a threshold value atwhich a filter is deemed too sensitive pruning, and thus is replacedinto the kernel matrix, or by comparing the weights of each kernel inthe filter to a threshold magnitude, and if there are too few kernelweight weights with a magnitude below the threshold then the filter isdeemed sensitive to pruning and should not be removed.

The filters with the small weights or that are not sensitive to pruningare then removed from the CNN. Thus, entire convolutional operationscorresponding to the removed filters may be eliminated from the CNN.After removal, the CNN may then be retrained back to its originalaccuracy without the removed filters. Accordingly, the pruned CNN 212has reduced resource requirements and can, therefore, can be moreeffectively implemented in the processing system 200. The pruning andretraining of the CNN may take place, e.g., on a separate computingsystem before being implemented in the processing system 200, or it maybe performed by the processing system 200 itself.

The image recognition results from the pruned CNN 212 can be combinedwith action prediction to predict changes to situations. Thus, theprocessing system 200 analyzes the recognized imaged from the pruned CNN212 to predict future movement and actions of features in the image orimages. Thus, dangerous and hazardous situations can be predicted andproactively addressed. The action prediction can be performed by thepruned CNN 212, or, e.g., by additional features such as, e.g.,recurrent neural networks (RNN) including long short-term memory, amongother prediction solutions.

The results, including the image recognition results and anypredictions, communicated from the computer processing 210 to atransmitter 220. The transmitter 220 may be included in the processingsystem 200, or it may be separate from the processing system 200. Thetransmitter 220 communicates the image recognition results to a remotelocation or device via a receiver 104 at that location or device. Thecommunication may be wired or wireless, or by any suitable means toprovide the image recognition results.

Upon receipt of the image recognition results, the receiver 104communicates the results to a notification system 106. The notificationsystem 106 provides a user with a notification of the recognized images,such as, an alert of a hazardous situation (for example, an alarm,flashing lights, visual message on a display, or auditory message from aspeaker, among others), a display of the recognized images including,e.g., present or predicted labelled images, object lists, actions, orconditions. Accordingly, the notification system 106 may take the formof, e.g., a display, a speaker, an alarm, or combinations thereof. Thus,a user may be notified of the features that have been recognized in animage by the pruned CNN 212, and act accordingly.

Referring now to FIG. 2, a system/method for image recognition using apruned convolution neural network (CNN) is illustratively depicted inaccordance with an embodiment of the present principles.

According aspects of an embodiment of the present invention, imagerecognition for surveillance may be performed to predict future actionsof people and things in an environment. The image recognition isperformed on an input image sequence 530. The image sequence 530 is aseries of still or video images of an environment being surveilled.

The image sequence 530 is received by a pruned CNN 500. The pruned CNN500 may be implemented in a computing device that has few resources toexpend on convolutions. As discussed above, this computing device caninclude, e.g., embedded systems or mobile devices, among others.Therefore, power and computing resources are limited. However, using oneof these computing devices allows for the use of surveillance equipmentthat is mobile, fast, and cheap. For example, the surveillance equipmentcould include, e.g., a smartphone, a drone, a smart doorbell system, adigital camera or camcorder, or any other suitable device. However, tofacilitate image recognition on-board the surveillance equipment or by asimilarly mobile, fast, cheap device in communication with thesurveillance equipment permits image recognition on-site in locationswhere access to high computing power and large power sources is notfeasible. Therefore, a pruned CNN 500 is implemented that has fewercomputation requirements.

The pruned CNN 500 may be generated from a CNN trained by a CNN trainingmodule 300. The CNN training module 300 may be external to thesurveillance equipment, such that upon training and pruning, the prunedCNN 500 is transferred to the surveillance equipment. However, the CNNtraining module 300 may alternatively be on-board the surveillanceequipment.

The CNN training module 300 trains a CNN that can include a suitable CNNfor image recognition, such as, e.g., a simple CNN including VGG-16, ora more complicated CNN including a residual CNN such as RESNET. The CNNmay be trained using any suitable dataset, such as, e.g., CIFAR-10. TheCNN training module 300 uses that dataset to input each image, and foreach image, apply filters and generate intermediate feature maps throughthe depth of the CNN, testing the final feature maps and updatingweights of the filters. This process may be performed any suitablenumber of times until the CNN is trained.

The trained CNN is then provided to a pruning module 400. The pruningmodule 400 prunes the CNN such that fewer convolutional operations needto be performed to generate a final feature map of the imagerecognition. Accordingly, the pruning module 400 can, for example, e.g.,remove filters from the CNN that that are not significant to the imagerecognition task. A filter can be determined to not be significantbecause it is relatively rarely activated during image recognition.Thus, a filter that does not meet a threshold that represents thefrequency of activation should be pruned. The threshold could be aweight value. In such a case, the weights of the filter are assessed andcompared to the threshold. If the threshold is not met, then the filteris pruned from the CNN to reduce convolutional operations.Alternatively, the threshold could be based on a number for filters. Inthis case, the weights of the filters are assessed and the filters canbe ranked according to their weights. A threshold number of the smallestfilters can then be pruned from the CNN to reduce convolutionaloperations. Assessment of the weights can include any suitableassessment, such as, e.g., determining an average or a median weight fora filter, determining a sum of the weights, or determining an absolutesum of the weights for a filter, among other techniques for assessment.

Accordingly, filters are removed from the CNN, reducing theconvolutional operations of the CNN, and thus reducing computationalrequirements. Upon pruning, the CNN may be returned the CNN trainingmodule 300 to be retrained without the pruned filters. This retrainingmay be performed after each filter is removed, or it may be performedafter pruning filters through the depth of the CNN. The CNN may beretrained until it reaches the original level of accuracy of the trainedCNN prior to pruning.

Upon pruning the CNN, the pruning module 400 provides the pruned CNN 500to the surveillance equipment for image recognition, as discussed above.The pruned CNN 500 receives the image sequence 530 and performs imagerecognition to recognize objects and conditions present in each image.The image recognition results may take the form of image featurevectors, each image feature vector containing information about objectsand conditions in a given image. Accordingly, in one embodiment, eachimage feature vector can include, e.g., 2048 dimensions for robustfeature determination corresponding to image recognition.

The image recognition can include predictions about the objects andconditions, such as future movement. These predictions assist a userwith surveillance by predicting possible threats, hazards or dangers.Therefore, the image feature vectors from the pruned CNN 500 can beprovided to, e.g., a recurrent neural network such as, e.g., a longshort-term memory network (LSTM) 510. The LSTM 510 will have beentrained to use the image feature vectors to generate an action featurevector that can be, e.g., smaller than the image feature vector. In oneembodiment, the action feature vector generated by the LSTM 510 caninclude, e.g., a 1024-dimension action feature vector. The LSTM 510generates, with the action feature vector, predictions of future actionsof the objects and conditions in the image sequence 530.

In one embodiment, action feature vector can then be processed togenerate probabilities of predicted actions. Accordingly, an action net520 is employed to receive the action feature vector. The action net 520will have been trained to use the action feature vector to generate nclass output probabilities that form the probabilities of predictedactions to generate a prediction 600 corresponding to the surveillance.

Referring now to FIG. 3, a system/method for pruning filters of aconvolution neural network (CNN) is illustratively depicted inaccordance with an embodiment of the present principles.

According to aspects of an embodiment of the present invention, thepruning module 400 prunes filters from a trained CNN based on, e.g., asum of absolute kernel weights in the filter. The filters are pruned byfirst inputting the trained CNN 301. The trained CNN 301 includesinitial, intermediate, and final feature maps with associated trainedfilters and kernel matrices. In training, weights of the kernel matricesare determined by inputting training images, applying the existingfilters with kernel matrices, and updated the weights of the kernelmatrices based on an error of the output.

Pruning unnecessary filters facilitates reducing convolutionaloperations of the trained CNN 301. Therefore, unnecessary filters aredetermined by a filter significance module 411 of the pruning module400.

The filter significance module 411 includes an absolute kernel weightsummer 411. The absolute kernel weight summer 411 calculates theabsolute kernel weights of a filter, and sums the absolute kernelweights. The sum of the absolute kernel weights is representative of themagnitude of the filter to which it applies. Therefore, a lower absolutekernel weight sum will indicate that a filter has less effect on theimage recognition process, and is thus activated less frequently.Accordingly, the lower the absolute kernel weight sum of a filter, thelower the significance of that filter.

Furthermore, sensitivity to pruning can be a determining factor forwhether a filter is pruned. Sensitivity may be assessed by, e.g.,removing a filter and testing the effect of the removal on accuracy todetermine if it exceeds a threshold value at which a filter is deemedtoo sensitive pruning, and thus is replaced into the kernel matrix, orby comparing the weights of each kernel in the filter to a thresholdmagnitude, and if there are too few kernel weight weights with amagnitude below the threshold then the filter is deemed sensitive topruning and should not be removed. Other methods of determiningsensitivity to pruning are also contemplated.

According to aspects of the present invention, based on the absolutekernel weight sums calculated by the absolute kernel weight summer 411,the filters of a given set of feature maps can be ranked by the rankdetermination unit 412. The rank of a given filter may be determined bythe magnitude of the absolute kernel weight sum pertaining to thatfilter, with the rank forming, e.g., a spectrum of filter magnitudesfrom least to greatest. However, other ranking methodologies arecontemplated.

The ranks filters may then be analyzed by the filter pruning module 420.The filter pruning module 420 includes a threshold comparison unit 421.This threshold comparison unit 421 is configured to compare the absolutekernel weight sums for each of the filters with a threshold. Thethreshold can be predetermined or can change according to a distributionof absolute kernel weight sums. For example, the threshold can include athreshold absolute kernel weight sum. In this embodiment, the absolutekernel weight sums of each filter are compared with the threshold value,and if the absolute kernel weight sum of a given filter is below thethreshold value, then the filter is removed. Alternatively, thethreshold can be a threshold number of filters. Thus, the thresholdcomparison unit 421 may compare the number of filters that exist in thetrained CNN 301 at a given convolutional layer with a threshold numberof filters. A difference, m, may then be taken between the thresholdvalue and the number of filters, with the difference, m, indicating thenumber of filters to be removed. Then, m of the smallest filters areremoved. However, this threshold value may instead indicate a set numberof the smallest filter to be removed. For example, rather than adifference, m may be a predetermined value of the smallest filters to beremoved from the convolutional layer.

Once filters of a given convolutional layer are removed, the CNN can beupdated to reflect the removal. Thus, a layer updater 422 will updatethe trained CNN 301 by removing the kernels for the subsequentconvolutional layer that corresponds to the removed filter. The kernelmatrices for the given and the subsequent convolutional layers can thenbe reconstructed to reflect the removed filters and removed kernels.

Each convolutional layer of the trained CNN 301 will be processed by thepruning module 400 as described above. The CNN can be retrainediteratively after each layer is pruned, or all of the layers may bepruned before retraining the CNN. However, in either case, upon pruningall of the layers, the pruning module 400 will output a pruned CNN 500to be used for image recognition. The pruned CNN 500 will besignificantly less complex than the original trained CNN 301, whilemaintaining substantially the same level of accuracy because only theinsignificant filters for the image recognition task are removed. Thus,the pruned CNN 500 can be incorporated, as discussed above, into asystem have low computational or power resources, such as, e.g.,embedded or mobile systems. Additionally, the pruned CNN 500 will alsobe faster than the original trained CNN 301 due to the reduction inconvolutional operations. Thus, even if used in a system having highresources, the pruned CNN 500 is still beneficial because the pruned CNN500 is more efficient and can therefore perform the surveillance imagerecognition task more quickly, thus having benefits even in applicationswhere resource constraints are not a concern.

Referring now to FIG. 4, a system/method for pruning filters of a simpleconvolution neural network (CNN) is illustratively depicted inaccordance with an embodiment of the present principles.

According to an embodiment of the present invention, filter 13 prunedfrom a kernel matrix 10 is illustrated. Pruning includes a set of inputfeature maps 1 and intermediate and/or output feature maps 2 and 3. Theinput feature maps 1 have filters applied with kernel matrix 10 togenerate feature maps 2. Similarly, feature maps 2 have filters appliedwith kernel matrix 20 to generate feature maps 3.

A feature map 13 may be pruned from feature maps 2 using the absolutekernel weight sums from kernel matrix 10, where each box in the kernelmatrix 10 corresponds to a kernel for a filter. A vertical axis of thekernel matrix 10 is the input channels 11 while the horizontal axis isthe output channels 12. Thus, the columns of kernel matrix 10 form a 3Dfilter for generating the feature maps 2.

To determine the magnitude, and thus significance, of a filter, absolutekernel magnitudes of each kernel in a given column are summed. Eachfilter can then be compared to a threshold, as discussed above. Thecomparison can result in a pruned filter 13. The pruned filter 13 may bepruned because the sum of absolute kernel weights of the kernel for thatfilter is below a threshold, or because it has the lower sum compared tothe other filters in the kernel matrix 10. Because the pruned filter 13is removed, the corresponding feature map 23 is also removed from thesubsequent layer of feature maps 2.

In the kernel matrix 20, as with the kernel matrix 10, the vertical axisrelates to an input channel 22. Because the input channel 22 of thissubsequent kernel matrix 20 corresponds to the output channel 12 of theprevious kernel matrix 10, the removal of the filter 13 is reflected inkernel matrix 20 by the removal of a row that corresponds to the columnof the pruned filter 13 in kernel matrix 10 and the feature map 23.

Accordingly, the pruning of a filter 13 results in the removal thecorresponding kernels of kernel matrix 10, as well as the subsequentcorresponding feature map 23 and the corresponding kernels in kernelmatrix 20 to the pruned filter 13.

Referring now to FIG. 5, a system/method for pruning intermediatefilters of a simple convolution neural network (CNN) is illustrativelydepicted in accordance with an embodiment of the present principles.

According to an embodiment of the present invention, filter 33 prunedfrom a kernel matrix 30 is illustrated. Pruning includes a set ofintermediate feature maps 4 and subsequent intermediate and/or outputfeature maps 5. The intermediate feature maps 4 have filters appliedwith kernel matrix 30 to generate subsequent feature maps 5.

In this example, feature map 44 has already been pruned fromintermediate feature maps 4. Accordingly, as discussed above, thefeature maps 4 being input into input channel 31 results in a row ofkernels for filter 34 in kernel matrix 30 being removed corresponding tothe removed feature map 44.

The output channel 32 is then pruned by determining the absolute kernelweight sum for each filter (column) in the kernel matrix 30. Because thefilter 34 has been removed from the kernel matrix 30, the weights of thekernels in that filter are not considered for the calculation ofabsolute kernel weight sums. The result of the calculation of absolutekernel weight sums permits the comparison of each filter with athreshold, as discussed above. As a result, filter 33 may be pruned asbeing below the threshold.

Because filter 33 for the output channel 32 has been removed, thecorresponding feature map 53 of subsequent feature maps 5 will also beremoved. Thus, a feature map from each of feature maps 4 and 5 areremoved according to the pruning process, thus reducing convolutionaloperations accordingly, and reducing the requirement resources ofemploying the CNN.

Referring now to FIG. 6, a system/method for pruning intermediatefilters of a residual convolution neural network (CNN) is illustrativelydepicted in accordance with an embodiment of the present principles.

According to an embodiment of the present invention, a more complicatedCNN may be pruned, such as, e.g., a residual CNN (for example, RESNET).In such a CNN, The RESNET CNN will include a projection shortcut kernelmatrix 60 and a residual block kernel matrix 40. Intermediate featuremaps 6 may include an already pruned feature map 63. To prune subsequentfilters, the feature maps 6 may be filtered with both the projectionshortcut kernel matrix 60 and the residual block kernel matrix 40independently, thus creating two separate sets of feature maps,projection shortcut feature maps 8 b and residual block feature maps 7.

The filters of kernel matrix 40 may be pruned to remove pruned filters43 according to systems and methods discussed above. As a result, thecorresponding feature maps 73 are removed from feature maps 7. However,filters of a subsequent kernel matrix 50 are not pruned according to athreshold comparison of the absolute kernel weight sums of filters inthe kernel matrix 50. Rather, pruned filters 53 are determined accordingto an analysis of the projection shortcut kernel matrix 60. Accordingly,the filters of the projection shortcut kernel matrix 60 under absolutekernel weight summation and comparison to a threshold according toaspects of the invention described above. As a result, pruned filters 53are determined to be pruned from the projection shortcut kernel matrix60.

Therefore, corresponding projection shortcut feature maps 8 b aregenerated in parallel to the residual block feature maps 7 andreflecting the removal of pruned filters 53. As a result of the removalof pruned filters 53, the corresponding projection shortcut feature maps83 b are removed from the projection shortcut feature maps 8 b. Theprojection shortcut feature maps 8 b are used to determine prunedfilters 53 in the subsequent residual block kernel matrix 50.Accordingly, resulting subsequent residual block feature maps 8 a areupdated to reflect the removal of pruned subsequent residual featuremaps 83 a corresponding to the pruned filter 53.

Referring now to FIG. 7, a high-level system/method for surveillance offorest fires with image recognition using a pruned convolution neuralnetwork (CNN) is illustratively depicted in accordance with anembodiment of the present principles.

In one embodiment, a system is contemplated for providing surveillancefor a forest 740. According to aspects of the present invention,surveillance may include image recognition to facilitate the recognitionand prediction of potentially dangerous or hazardous situations 742,such as, e.g., fires, floods, or any other natural or man-made conditionin the forest 740.

Surveillance may be performed by one or more image capture devices 701.The image capture devices 701 is used to capture an image or sequence ofimages from a remote location of the forest 740 such that the image orsequence of images may be analyzed to recognize and predict a hazardoussituation, such as, e.g., a wildfire 742. Accordingly, the image capturedevices 701 may each include, e.g., a still or video camera mounted in aremote location, such as, e.g., a fixed tower or an autonomous orremotely operated drone. The image capture devices 701 can also includeany other device incorporating a suitable sensor for capturing images(for example, e.g., a device including a charge-coupled device (CCD), aphotodiode, a complementary metal oxide semiconductor (CMOS) sensor,infrared sensor, laser distance and ranging (LIDAR) sensor, amongothers). The remote image capture devices 701 can, therefore, includebatteries for providing power to components.

The image capture devices 701 may each include a processing system 700for performing the analysis, including image recognition. The processingsystem 700 may an internal component to the image capture device 701, ormay be external and in communication to the image capture device 701.Accordingly, the processing system 700 may include, e.g., asystem-on-chip (SOC), a computer processing unit (CPU), a graphicalprocessing unit (GPU), a computer system, a network, or any otherprocessing system.

The processing system 700 may include a computer processing device 710for processing the image or sequence of images provided by the imagecapture devices 701. Accordingly, the computer processing device 710 caninclude, e.g., a CPU, a GPU, or other processing device or combinationof processing devices. According to aspects of an embodiment of thepresent invention, the processing system 700 is an embedded processingsystem including, e.g., a SOC.

Such systems have strict power and resource constraints because they donot benefit from grid power or large packaging then provide a largevolume of power, storage or processing power. Therefore, to perform theimage recognition, the computer processing device 710 includes a prunedCNN 712. The pruned CNN 712 reduces the complexity of the CNN for imagerecognition by pruning particular portions of the CNN that have arelatively insignificant impact on the accuracy of the imagerecognition. Thus, pruning a CNN can reduce the resource requirements ofthe CNN by reducing complexity, while having minimal impact on theaccuracy of the CNN. As a result, the pruned CNN 712 may be, e.g.,stored in volatile memory, such as, e.g., dynamic random access memory(DRAM), processor cache, random access memory (RAM), or a storagedevice, among other possibilities. Accordingly, the pruned CNN 712 maybe stored locally to an embedded or mobile processing system and provideaccurate image recognition results.

The image recognition results from the pruned CNN 712 can be combinedwith action prediction to predict changes to situations. Thus, theprocessing system 700 analyzes the recognized imaged from the pruned CNN712 to predict future movement and actions of features in the image orimages. Thus, forest fires, for example, can be predicted andproactively addressed. The action prediction can be performed by thepruned CNN 712, or, e.g., by additional features such as, e.g.,recurrent neural networks (RNN) including long short-term memory, amongother prediction solutions.

The results, including the image recognition results and anypredictions, can be communicated from the computer processing system 710to a transmitter 720. The transmitter 720 may be included in theprocessing system 700, or it may be separate from the processing system700. The transmitter 720 communicates the image recognition results to aremote location or device via a receiver 722 at that location or device.The communication may be wired or wireless, or by any suitable means toprovide the image recognition results.

Upon receipt of the image recognition results, the receiver 722communicates the results to a notification system 730. The notificationsystem 730 provides a user with a notification of situations predictedwith the recognized images, such as, an alert of a hazardous situationsuch as, e.g. a wildfire 742 by, e.g., a display of the recognizedimages and predicted situations including, e.g., present or predictedlabelled images, object lists, actions, or conditions. Accordingly, thenotification system 730 may take the form of, e.g., a display, aspeaker, an alarm, or combinations thereof. Thus, a user may be notifiedof the features that have been recognized and predicted from an image bythe pruned CNN 712, and act accordingly.

Moreover, the multiple remote image capture devices 701 can be deployedto perform image recognition and prediction in a single area. Theresults from each of the image capture devices 701 can then be crossreferenced to determine a level of confidence of the results. Thus,accuracy of the image recognition and prediction can be improved usingmultiple remote image capture devices 701.

Referring now to FIG. 8, a high-level system/method for surveillance forvehicles with image recognition using a pruned convolution neuralnetwork (CNN) is illustratively depicted in accordance with anembodiment of the present principles.

In one embodiment, a system is contemplated for providing surveillancefor a vehicle 830, such as, e.g., an autonomous vehicle, asemi-autonomous vehicle, an autonomous drone, a user operated vehicle,among others. According to aspects of the present invention,surveillance may include image recognition to facilitate the recognitionand response to potentially dangerous or hazardous situations, such as,e.g., pedestrians 840, obstacles, other vehicles, lane line, amongothers.

Surveillance may be performed by an image capture device 801. The imagecapture device 801 is used to capture an image or sequence of images ofthe vehicle 830 such that the image or sequence of images may beanalyzed to recognize or predict route obstacles. Accordingly, the imagecapture device 801 may include, e.g., a still or video camera, oranything device incorporating a suitable sensor for capturing images(for example, e.g., a device including a charge-coupled device (CCD), aphotodiode, a complementary metal oxide semiconductor (CMOS) sensor,infrared sensor, laser distance and ranging (LIDAR) sensor, amongothers). Moreover, the image capture device 801 can be included with thevehicle 840, or it can be a separate device that is mountable to thevehicle 830. In any case, the image capture device 801 can include apower source, such as, e.g., a battery, or it can be powered by abattery included in the vehicle 830.

The image capture device 801 may include a processing system 800 forperforming the analysis, including image recognition. The processingsystem 800 may an internal component to the image capture device 801, ormay be external and in communication to the image capture device 801.Accordingly, the processing system 800 may include, e.g., asystem-on-chip (SOC), a computer processing unit (CPU), a graphicalprocessing unit (GPU), a computer system, a network, or any otherprocessing system.

The processing system 800 may include a computer processing device 810for processing the image or sequence of images provided by the imagecapture device 801. Accordingly, the computer processing device 810 caninclude, e.g., a CPU, a GPU, or other processing device or combinationof processing devices. According to aspects of an embodiment of thepresent invention, the processing system 800 is an embedded processingsystem including, e.g., a SOC, or a mobile device such as, e.g., asmartphone.

Such systems have strict power and resource constraints because they donot benefit from grid power or large packaging then provide a largevolume of power, storage or processing power. Therefore, to perform theimage recognition, the computer processing device 810 includes a prunedCNN 812. The pruned CNN 812 reduces the complexity of the CNN for imagerecognition by pruning particular portions of the CNN that have arelatively insignificant impact on the accuracy of the imagerecognition. Thus, pruning a CNN can reduce the resource requirements ofthe CNN by reducing complexity, while having minimal impact on theaccuracy of the CNN. As a result, the pruned CNN 812 may be, e.g.,stored in volatile memory, such as, e.g., dynamic random access memory(DRAM), processor cache, random access memory (RAM), or a storagedevice, among other possibilities. Accordingly, the pruned CNN 812 maybe stored locally to an embedded or mobile processing system and provideaccurate image recognition results.

The image recognition results from the pruned CNN 812 can be combinedwith action prediction to predict changes to situations. Thus, theprocessing system 800 analyzes the recognized imaged from the pruned CNN812 to predict future movement and actions of features in the image orimages. Thus, obstacles in the road, such as, e.g., pedestrians, downedtrees, animals, other vehicles, etc. can be predicted and proactivelyaddressed. The action prediction can be performed by the pruned CNN 812,or, e.g., by additional features such as, e.g., recurrent neuralnetworks (RNN) including long short-term memory, among other predictionsolutions.

The results, including the image recognition results and anypredictions, can be communicated from the computer processing system 810to a transmitter 820. The transmitter 820 may be included in theprocessing system 800, or it may be separate from the processing system800. The transmitter 820 communicates the image recognition results to areceiver 804. The communication may be wired or wireless, or by anysuitable means to provide the image recognition results. The receiver804 may be external to the vehicle 830, or it may a part of the vehicle830. Thus, the computer processing system 810 can send the results to,e.g., a remote operator, a local operator, or to the vehicle 830 itself.

Upon receipt of the image recognition results, the receiver 804 cancommunicate the results to a notification system 806. The notificationsystem 806 provides a user with a notification of the recognized images,such as a pedestrian 840 by, e.g., a display of the recognized imagesand predicted obstacles including, e.g., present or predicted labelledimages, object lists, actions, or conditions. Accordingly, thenotification system 806 may take the form of, e.g., a display, aspeaker, an alarm, or combinations thereof. Thus, a user may be notifiedof the features that have been recognized and obstacles predicted by thepruned CNN 812, and act accordingly. The notification system 806 may beat a remote location for a remote device operator, or it may be internalto the vehicle, such as in the case of notifying an operator of thevehicle. Additionally, the processing system 800 of the vehicle 830 canbe programmed to automatically take action to avoid a recognized orpredicted obstacle by, e.g., turning, stopping, or taking any othersuitable evasive action.

Referring now to FIG. 9, a flow diagram illustrating a system/method forimage recognition using a pruned convolution neural network (CNN) isillustratively depicted in accordance with an embodiment of the presentprinciples.

At block 901, feature maps and corresponding convolution filters areextracted from a convolutional neural network (CNN) for surveillanceimage recognition.

The CNN is a fully trained CNN for image recognition. Through thetraining and testing process of the CNN, the convolution layers can beextracted. Each convolutional layer includes a set of feature maps witha corresponding kernel matrix. The feature maps have correspondingfilters that influence the generation of the feature maps according tothe kernel weights for each filter. As discussed above, a filter may bea 3-dimensional (3D) filter having a kernel at each input channel of akernel matrix. Thus, a set of kernels, one from each input channel,forms a filter for an output channel of the kernel matrix. The filter isused to generate an output feature map. The kernel matrix includes aplurality of filters, each corresponding to a particular feature map.

At block 902, absolute magnitude kernel weights are summed for eachconvolutional filter.

As discussed above, each filter includes a kernel from each inputchannel of the kernel matrix. The magnitude of a filter can berepresented by calculating the absolute kernel weight of each kernel,and then summing those weights for each filter. The magnitude of afilter is indicative of the frequency with which a filter is activatedfor image recognition.

At block 903, low magnitude convolutional filters and correspondingfeature maps are removed according to the absolute magnitude kernelweight sums.

Because convolutional operations are very computational costly, theremoval of convolutional filters, which reduces convolutionaloperations, can reduce the computation cost of the CNN. Therefore, upondetermining the magnitude of each filter in a layer, filters with lowmagnitudes can be removed.

The filters to be removed can be determined by a comparison with athreshold value. For example, filters having a magnitude below a certainthreshold can be removed. As another example, a certain number of thesmallest filters can be removed. Other possible thresholds arecontemplated.

Alternatively, or in addition, sensitivity to pruning may be adetermining factor for whether a filter is pruned. Sensitivity may beassessed by, e.g., removing a filter and testing the effect of theremoval on accuracy to determine if it exceeds a threshold value atwhich a filter is deemed too sensitive pruning, and thus is replacedinto the kernel matrix, or by comparing the weights of each kernel inthe filter to a threshold magnitude, and if there are too few kernelweight weights with a magnitude below the threshold then the filter isdeemed sensitive to pruning and should not be removed.

Because a filter is removed, the corresponding output feature map fromthat filter should be removed as well. Thus, the feature map set at agiven layer is reduced in size. This reduction results in acorresponding reduction in resource requirements for implementing theCNN.

At block 904, kernels of a subsequent convolutional layer correspondingto the removed feature maps are removed.

The removal of a particular filter and feature map affects subsequentkernel matrices. In particular. A filter, including a set of kernels foran output channel in the kernel matrix of a given layer, that is removedfrom the kernel matrix result in the corresponding kernels being absentfrom the input channel of a kernel matrix in a subsequent layer. As aresult, the subsequent kernel matrix should be updated with the kernelsof the removed filter pruned so that the pruning of the subsequentkernel matrix is not influenced by the removed kernels.

At block 905, the CNN is retrained with updated filters.

Because filters have been removed, the CNN can be retrained tocompensate for the absent filters. Because the removed filters have lowimpact on the image recognition task as indicated by the low magnitudes,the CNN can be retrained without the filters to regain the originalaccuracy of the trained CNN prior to pruning. Thus, a CNN may be prunedto reduce computation resources by removing convolutional operationsassociated with filter operations while not negatively impacting theaccuracy of the CNN.

The foregoing is to be understood as being in every respect illustrativeand exemplary, but not restrictive, and the scope of the inventiondisclosed herein is not to be determined from the Detailed Description,but rather from the claims as interpreted according to the full breadthpermitted by the patent laws. It is to be understood that theembodiments shown and described herein are only illustrative of theprinciples of the present invention and that those skilled in the artmay implement various modifications without departing from the scope andspirit of the invention. Those skilled in the art could implementvarious other feature combinations without departing from the scope andspirit of the invention. Having thus described aspects of the invention,with the details and particularity required by the patent laws, what isclaimed and desired protected by Letters Patent is set forth in theappended claims.

What is claimed is:
 1. A method for pruning a convolutional neuralnetwork (CNN) for surveillance with image recognition, the methodcomprising: extracting at least one convolutional layer from a trainedCNN, each convolutional layer including a kernel matrix having at leastone filter formed in a corresponding output channel of the kernelmatrix, and feature map set having a feature map corresponding to eachof the at least one filter; determining an absolute kernel weight foreach kernel in the kernel matrix; summing the absolute kernel weights ofeach kernel in each of the at least one filter to determine a magnitudeof each filter; comparing the magnitude of each filter with a thresholdand removing one or more filters that are below the threshold; removinga feature map corresponding to each of the removed filters to prune theCNN of filters; and retraining the CNN upon pruning the removed filtersto generate a pruned CNN having fewer convolutional layers toefficiently recognize and predict conditions in an environment beingsurveilled.
 2. The method as recited in claim 1, further comprisingremoving the kernels of the removed filter from subsequent kernelmatrices.
 3. The method as recited in claim 2, further comprising, uponremoving the kernels of the removed filter from subsequent kernelmatrices, pruning the subsequent kernel matrices.
 4. The method asrecited in claim 1, further comprising iteratively retraining the CNNupon pruning each convolutional layer.
 5. The method as recited in claim1, further comprising retraining the CNN upon pruning everyconvolutional layer.
 6. The method as recited in claim 1, wherein thethreshold is a value corresponding to a minimum absolute kernel weightsum.
 7. The method as recited in claim 1, wherein the threshold is avalue corresponding to a number of filters having the smallest absolutekernel weight sums to be removed.
 8. A non-transitory computer readablestorage medium comprising a computer readable program for surveillancewith image recognition using a pruned convolutional neural network(CNN), wherein the computer readable program when executed on a computercauses the computer to perform the steps of: extracting at least oneconvolutional layer from a trained CNN, each convolutional layerincluding a kernel matrix having at least one filter formed in acorresponding output channel of the kernel matrix, and feature map sethaving a feature map corresponding to each of the at least one filter;determining an absolute kernel weight for each kernel in the kernelmatrix; summing the absolute kernel weights of each kernel in each ofthe at least one filter to determine a magnitude of each filter;comparing the magnitude of each filter with a threshold and removing oneor more filters that are below the threshold; removing a feature mapcorresponding to each of the removed filters to prune the CNN offilters; and retraining the CNN upon pruning the removed filters togenerate a pruned CNN having fewer convolutional layers to efficientlyrecognize and predict conditions in an environment being surveilled. 9.The computer readable program as recited in claim 8, further comprisingremoving the kernels of the removed filter from subsequent kernelmatrices.
 10. The computer readable program as recited in claim 9,further comprising, upon removing the kernels of the removed filter fromsubsequent kernel matrices, pruning the subsequent kernel matrices. 11.The computer readable program as recited in claim 8, further comprisingiteratively retraining the CNN upon pruning each convolutional layer.12. The computer readable program as recited in claim 8, furthercomprising retraining the CNN upon pruning every convolutional layer.13. The computer readable program as recited in claim 8, wherein thethreshold is a value corresponding to a minimum absolute kernel weightsum.
 14. The computer readable program as recited in claim 8, whereinthe threshold is a value corresponding to a number of filters having thesmallest absolute kernel weight sums to be removed.
 15. An imagerecognition system for surveillance, the system comprising: an imagecapture device for capturing images of an environment to be surveilled;an image recognition system in an embedded computing device included inthe image capture device configured to perform image recognition with apruned CNN, the image recognition system including: an absolute kernelweight summer configured to determining an absolute kernel weight foreach kernel in the kernel matrix and sum the absolute kernel weights ofeach kernel in a filter corresponding to each of at least one outputchannel of a kernel matrix to determine a magnitude of each filter, eachfilter corresponding to a feature map; a threshold comparison unitconfigured to comparing the magnitude of each filter with a thresholdand removing one or more filters that are below the threshold; a layerupdater configured to removing a feature map corresponding to each ofthe removed filters to prune the CNN of filters and generate the prunedCNN; a long short-term memory network (LSTM) for predicting featureactions; an action network for generating class probabilities of featureactions; and a notification device for notifying a user of the classprobabilities.
 16. The system as recited in claim 15, wherein the layerupdate is further configured to remove the kernels of the removed filterfrom subsequent kernel matrices.
 17. The system as recited in claim 15,wherein the image recognition system is further configured toiteratively retrain the CNN upon pruning each convolutional layer. 18.The system as recited in claim 15, wherein the image recognition systemis further configured to retrain the CNN upon pruning everyconvolutional layer.
 19. The system as recited in claim 15, wherein thethreshold is a value corresponding to a minimum absolute kernel weightsum.
 20. The system as recited in claim 15, wherein the threshold is avalue corresponding to a number of filters having the smallest absolutekernel weight sums to be removed.